ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1 1 46 Low 26 Low 3009.60
2 2 40 Low 56 Medium 57347.28
3 3 35 Low 293 High 14496.16
4 4 50 Low 18 Low 1416.20
5 5 77 Medium 14 Low 523.72
6 6 55 Medium 39 Low 8830.32
Monetary_Value_Level Observation_Period Churn
1 Medium 742 0
2 High 2301 0
3 High 2411 0
4 Medium 813 0
5 Low 1 0
6 Medium 2077 0
3 Importing Data
3.1 Introduction
In this chapter, we discuss the main topics regarding data imports. Understanding how to effectively import data is a crucial skill for data analysis, as it forms the foundation upon which all subsequent analysis is built. As such, the principles we will cover are broadly applicable across various software environments.
3.2 Spreadsheets and File Types
Data sets are typically stored in all kinds of formats. Probably the
most common type is the table form or electronic
spreadsheet (e.g., Excel format). A spreadsheet is
similar to a data frame or matrix, because it consists of rows and
columns. The type of file determines how we import it into R. Common
file types include Excel workbooks, CSV files, or text files with
specific delimiters like tabs or semicolons. For instance, a CSV
file (Comma-Separated Values) uses commas to separate values within
each row. Understanding these file types is crucial because it
influences how data is read into R using appropriate functions or
packages. For instance, the file customer_churn below
is seen with a text editor:
The first row contains headers, which might appear wrapped due to length but, in terms of structure, they are still a single row. By understanding the file type and structure, we can accurately import our data in R.
3.3 Paths and the working directory
Except for the file type, we need to know the
path of a file. The path of a file essentially
denotes where the file is stored. Usually, we can have
these files in organized folders, which are called
directories. Although the names may not be so
intuitive, the important thing to remember is that, to import a file
in RStudio, we need to know its type and where it is. To understand
the terminology, suppose we have a csv file called
customer_churn in a folder called
Data Sets. A possible path in that case would be
C:/Users/User/Desktop/Data Set/customer_churn.csv.
Let’s break it down:
-
Full path:
C:/Users/User/Desktop/Data Sets/customer_churn.csv -
Directory Path:
C:/Users/User/Desktop/Data Sets -
Directory:
Data Sets -
File:
customer_churn.csv
So, by using the full path, we can import a data set in R. To see
how this works in practice, we implement what we just described. The
built-in function to import a csv file in R is the
read.csv() function. Inside the function we specify the
full path and we can store the data in an object directly. In our
example, we give the name churn_data to the object in
which we want to store the imported data.
# Import the data
churn_data <- read.csv("C:/Users/User/Desktop/Data Sets/customer_churn.csv")
# Print the first 6 rows
head(churn_data)
When we work with R though, we are always located “somewhere” in the
computer in which we work. In other words, R assumes that we have a
specific path, from which we work. This is called our
working directory. With working directory, there is
no need to specify the full path every time when we import a data
set; we can simply use the file name instead of the full path inside
the function. Before we see how this works, let’s check our current
working directory. For this, we can use the function
getwd():
# Get Working Directory
getwd()
[1] "C:/Users/User/Document"
We see that our working directory is
C:/Users/User/Document. To change the working directory, we
can use the function setwd(). For instance, suppose we
want to change the working directory from
C:/Users/User/Document to
C:/Users/User/Desktop/Data Sets. To do this, we enter the
desired directory path inside the parenthesis.
# Change Working Directory
setwd("C:/Users/User/Desktop/Data Sets")
Now, if we use the getwd() again, we see that our
working directory is different.
# Get Working Directory
getwd()
[1] "C:/Users/User/Desktop/Data Sets"
As our working directory is the Data Sets directory, we
can now use the read.csv() function by only filling the
name of the file in the parenthesis:
# Import the data
churn_data <- read.csv("Customer_Churn.csv")
# Print the first 6 rows
head(churn_data)
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1 1 46 Low 26 Low 3009.60
2 2 40 Low 56 Medium 57347.28
3 3 35 Low 293 High 14496.16
4 4 50 Low 18 Low 1416.20
5 5 77 Medium 14 Low 523.72
6 6 55 Medium 39 Low 8830.32
Monetary_Value_Level Observation_Period Churn
1 Medium 742 0
2 High 2301 0
3 High 2411 0
4 Medium 813 0
5 Low 1 0
6 Medium 2077 0
We see that the data import occurs successfully. In this way, we can import different data sets quite efficiently. Another advantage is that, when we share our R script, our code is more readable and other people can easily run the script under the assumption that their working directory contains the same data set. Note that when a file is located in the working directory, we can still use the full path if we want; the result would be exactly the same.
The difference between a slash (‘/’) and a backslash (‘\’) in the context of a working directory primarily relates to their usage in different operating systems and how they denote paths in a file system:
-
Slash (
/) is commonly used in Unix-like operating systems (Linux, macOS) and URLs. -
Backslash (
\) is primarily used in Windows operating systems.
However, in R, as in many other programming languages, the
backslash (\) is used as an escape character. This
means that when R sees a backslash, it expects it to be followed
by another character or sequence that represents a special
character or command (e.g., \n for newline).
Example:
-
Incorrect (using a single backslas):
"C:\Users\YourName\Documents" -
Correct (using double backslashes):
"C:\\Users\\YourName\\Documents" -
Preferred (using forward slashes):
"C:/Users/YourName/Documents"
For ease of use and to avoid errors with escape characters,
using forward slashes (/) in file paths is usually
the best option when working in R, irrespective of platform.
3.4 Importing Data in RStudio
Now that we understand “path” and “directory”, we can examine in
practice how to import a data set in R by using the RStudio
functionality. To keep things simple, we import the same data set
with the same full path as described. So, the csv file that we
import is called “Customer_Churn” and the directory path is
C:/Users/User/Desktop/Data Sets. As shown earlier, we can
use the read.csv() function to import this data set.
However, RStudio also provides a user-friendly functionality that can help us import our data sets in a relatively straightforward manner. By choosing File -> Import Dataset, we can see that RStudio provides us with different options, such as “From Text (readr)…”. By clicking this option, we will see the following output:
To find the file in our computer, we click the option Browse on the top right corner and find the file by browsing in our computer system. When we find the file we want, we click on it and visualize it on the emerging table:
Generally, we can see a number of options to change how RStudio
imports the file. In this way, we see exactly what RStudio will
import and, respectively, we can make adjustments before the import
takes place. Lastly, we see the exact R code that makes the import
on the bottom right corner on the bottom right corner. This is very
valuable because not only can we use this code later, but also we
can learn how to import a data set by coding directly on the
console. In this example we note that RStudio used the
readr package and the
read_csv() function to make this import possible. This
function can be thought of as an advanced version of the
read.csv() function. The details are not important at
this point; our goal is to capture the intuition about paths,
directory and the overall functionality of RStudio regarding data
import.
3.5 Importing Data from other sources
It is possible to import data in R from various sources, including relational database platforms such as MySQL, as well as directly from web pages via URLs. Additionally, R can be used for web scraping, which involves extracting data from HTML or directly from web pages. Given the variety of data sources, it’s impractical to cover every possible method in detail. However, the core idea remains the same: we need to guide R to the location of the data and specify the appropriate function for importing it, as different file types require different functions.